Clustering With Constraints Using Graph Based Approach

نویسندگان

  • Haytham Elghazel
  • Khalid Benabdeslem
  • Alain Dussauchoy
چکیده

Clustering can be considered as the most important unsupervised learning problem which deals with finding a structure in a collection of unlabeled data. To this end, it conducts a process of organizing objects into groups whose members are similar in some way and dissimilar to those of other groups [1]. While this process yields in an entirely unsupervised manner, additional background information (namely constraints) are available in some domains and must be considered in the clustering solutions. These latter vary from the user and the domain but we are usually interested to the use of background information in the form of instance-level must-link and cannot-link constraints. A must-link constraint enforces that two instances must be placed in the same cluster while a cannot-link constraint enforces that two instances must not be placed in the same cluster. Setting these constraints requires some modifications in the clustering algorithms which is not always feasible. Many authors investigated the use of constraints in clustering problem. In [2], the authors have proposed a modified version of COBWEB clustering algorithm that uses background information about pairs of instances to constrain their cluster placement. Equally, a recent work [3] has looked at extending the ubiquitous k-Means algorithm to incorporate the same types of instance-level hard constraints (must-link and cannot-link). Recently, we have proposed a new clustering approach [4] based on the concept of b-coloring of a graph [5]. It exhibits more important clustering features and enables to build a fine partition of the data set (numeric or symbolic) in clusters when the number of clusters is not specified beforehand. A graph b-coloring is the assignment of colors (clusters) to the vertices of the graph such that (i) no two adjacent vertices have the same color (proper coloring), (ii) for each color there exists at least one dominating vertex which is adjacent to all the other colors. This specific vertex reflects the properties of the class and also guarantees that the class has a distinct separation from all other classes of the partitioning. In this paper, we are interested in ways to integrate background information into the b-coloring based clustering algorithm. The proposed algorithm which we will refer to as COP-b-coloring (for constraint portioning bcoloring) is evaluated against benchmark data sets and the results of this study indicate the effectiveness of the instance-level hard constraints to offer real benefits (accuracy and runtime) for clustering problem.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generating Optimal Timetabling for Lecturers using Hybrid Fuzzy and Clustering Algorithms

UCTTP is a NP-hard problem, which must be performed for each semester frequently. The major technique in the presented approach would be analyzing data to resolve uncertainties of lecturers’ preferences and constraints within a department in order to obtain a ranking for each lecturer based on their requirements within a department where it is attempted to increase their satisfaction and develo...

متن کامل

A Graph-Based Clustering Approach to Identify Cell Populations in Single-Cell RNA Sequencing Data

Introduction: The emergence of single-cell RNA-sequencing (scRNA-seq) technology has provided new information about the structure of cells, and provided data with very high resolution of the expression of different genes for each cell at a single time. One of the main uses of scRNA-seq is data clustering based on expressed genes, which sometimes leads to the detection of rare cell populations. ...

متن کامل

A Graph-Based Clustering Approach to Identify Cell Populations in Single-Cell RNA Sequencing Data

Introduction: The emergence of single-cell RNA-sequencing (scRNA-seq) technology has provided new information about the structure of cells, and provided data with very high resolution of the expression of different genes for each cell at a single time. One of the main uses of scRNA-seq is data clustering based on expressed genes, which sometimes leads to the detection of rare cell populations. ...

متن کامل

Optimal Self-healing of Smart Distribution Grids Based on Spanning Trees to Improve System Reliability

In this paper, a self-healing approach for smart distribution network is presented based on Graph theory and cut sets. In the proposed Graph theory based approach, the upstream grid and all the existing microgrids are modeled as a common node after fault occurrence. Thereafter, the maneuvering lines which are in the cut sets are selected as the recovery path for alternatives networks by making ...

متن کامل

Graph-Based Clustering with Constraints

A common way to add background knowledge to the clustering algorithms is by adding constraints. Though there had been some algorithms that incorporate constraints into the clustering process, not much focus was given to the topic of graph-based clustering with constraints. In this paper, we propose a constrained graph-based clustering method and argue that adding constraints in distance functio...

متن کامل

Clustering in WSN Based on Minimum Spanning Tree Using Divide and Conquer Approach

Due to heavy energy constraints in WSNs clustering is an efficient way to manage the energy in sensors. There are many methods already proposed in the area of clustering and research is still going on to make clustering more energy efficient. In our paper we are proposing a minimum spanning tree based clustering using divide and conquer approach. The MST based clustering was first proposed in 1...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007